智能论文笔记

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan , Noah Brown , Justice Carbajal , Yevgen Chebotar , Joseph Dabis , Chelsea Finn , Keerthana Gopalakrishnan , Karol Hausman , Alex Herzog , Jasmine Hsu

分类：机器人 | 人工智能 | 自然语言处理 | 计算机视觉 | 机器学习

2022-12-13

By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io

translated by 谷歌翻译

Self-supervised AutoFlow

Hsin-Ping Huang , Charles Herrmann , Junhwa Hur , Erika Lu , Kyle Sargent , Austin Stone , Ming-Hsuan Yang , Deqing Sun

分类：计算机视觉

2022-12-04

Recently, AutoFlow has shown promising results on learning a training set for optical flow, but requires ground truth labels in the target domain to compute its search metric. Observing a strong correlation between the ground truth search metric and self-supervised losses, we introduce self-supervised AutoFlow to handle real-world videos without ground truth labels. Using self-supervised loss as the search metric, our self-supervised AutoFlow performs on par with AutoFlow on Sintel and KITTI where ground truth is available, and performs better on the real-world DAVIS dataset. We further explore using self-supervised AutoFlow in the (semi-)supervised setting and obtain competitive results against the state of the art.

translated by 谷歌翻译

Open-vocabulary Queryable Scene Representations for Real World Planning

Boyuan Chen , Fei Xia , Brian Ichter , Kanishka Rao , Keerthana Gopalakrishnan , Michael S. Ryoo , Austin Stone , Daniel Kappler

分类：机器人 | 人工智能 | 计算机视觉

2022-09-20

大型语言模型（LLM）从人类的指示中解开了任务计划的新功能。但是，事先尝试将LLMS应用于现实世界的机器人任务受到周围场景中缺乏接地的限制。在本文中，我们开发了NLMAP，这是一个开放式摄影和可查询场景表示，以解决此问题。 NLMAP是一个框架，可以将上下文信息收集到LLM计划者中，从而在生成上下文条件条件计划之前，可以在场景中查看和查询可用的对象。 NLMAP首先使用视觉语言模型（VLM）建立自然语言可查询场景表示。基于LLM的对象建议模块解析指令并提出涉及的对象，以查询场景表示以获取对象可用性和位置。然后，LLM规划师计划提供有关场景的此类信息。 NLMAP允许机器人在没有固定的对象列表或可执行选项的情况下操作，从而使真实的机器人操作无法通过以前的方法实现。项目网站：https：//nlmap-saycan.github.io

translated by 谷歌翻译

Simple Open-Vocabulary Object Detection with Vision Transformers

Matthias Minderer , Alexey Gritsenko , Austin Stone , Maxim Neumann , Dirk Weissenborn , Alexey Dosovitskiy , Aravindh Mahendran , Anurag Arnab , Mostafa Dehghani , Zhuoran Shen

分类：计算机视觉

2022-05-12

将简单的体系结构与大规模预训练相结合已导致图像分类的大量改进。对于对象检测，预训练和缩放方法的确定性不佳，尤其是在长尾和开放式摄影的环境中，训练数据相对较少。在本文中，我们提出了一个强大的配方，用于将图像文本模型转移到开放式对象检测中。我们使用具有最小修改，对比度文本预训练和端到端检测微调的标准视觉变压器体系结构。我们对该设置的缩放属性的分析表明，增加图像级预训练和模型大小在下游检测任务上产生一致的改进。我们提供适应性策略和正规化，以实现零击文本条件和单次图像条件对象检测的非常强劲的性能。代码和型号可在GitHub上找到。

translated by 谷歌翻译

Conditional Object-Centric Learning from Video

Thomas Kipf , Gamaleldin F. Elsayed , Aravindh Mahendran , Austin Stone , Sara Sabour , Georg Heigold , Rico Jonschkowski , Alexey Dosovitskiy , Klaus Greff

分类：计算机视觉 | 机器学习 | (统计)机器学习

2021-11-24

以对象为中心的表示是通过提供柔性抽象可以在可以建立的灵活性抽象来实现更系统的推广的有希望的途径。最近的简单2D和3D数据集的工作表明，具有对象的归纳偏差的模型可以学习段，并代表单独的数据的统计结构中的有意义对象，而无需任何监督。然而，尽管使用越来越复杂的感应偏差（例如，用于场景的尺寸或3D几何形状），但这种完全无监督的方法仍然无法扩展到不同的现实数据。在本文中，我们采取了弱监督的方法，并专注于如何使用光流的形式的视频数据的时间动态，2）调节在简单的对象位置上的模型可以用于启用分段和跟踪对象在明显更现实的合成数据中。我们介绍了一个顺序扩展，以便引入我们训练的推出，我们训练用于预测现实看的合成场景的光流，并显示调节该模型的初始状态在一小组提示，例如第一帧中的物体的质量中心，是足以显着改善实例分割。这些福利超出了新型对象，新颖背景和更长的视频序列的培训分配。我们还发现，在推论期间可以使用这种初始状态调节作为对特定物体或物体部分的型号查询模型，这可能会为一系列弱监管方法铺平，并允许更有效的互动训练有素的型号。

translated by 谷歌翻译

HandsOff: Labeled Dataset Generation With No Additional Human Annotations

Austin Xu , Mariya I. Vasileva , Achal Dave , Arjun Seshadri

分类：计算机视觉 | 机器学习

2022-12-24

Recent work leverages the expressive power of generative adversarial networks (GANs) to generate labeled synthetic datasets. These dataset generation methods often require new annotations of synthetic images, which forces practitioners to seek out annotators, curate a set of synthetic images, and ensure the quality of generated labels. We introduce the HandsOff framework, a technique capable of producing an unlimited number of synthetic images and corresponding labels after being trained on less than 50 pre-existing labeled images. Our framework avoids the practical drawbacks of prior work by unifying the field of GAN inversion with dataset generation. We generate datasets with rich pixel-wise labels in multiple challenging domains such as faces, cars, full-body human poses, and urban driving scenes. Our method achieves state-of-the-art performance in semantic segmentation, keypoint detection, and depth estimation compared to prior dataset generation approaches and transfer learning baselines. We additionally showcase its ability to address broad challenges in model development which stem from fixed, hand-annotated datasets, such as the long-tail problem in semantic segmentation.

translated by 谷歌翻译

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Vivek Rathod , Bryan Seybold , Sudheendra Vijayanarasimhan , Austin Myers , Xiuye Gu , Vighnesh Birodkar , David A. Ross

分类：计算机视觉

2022-12-20

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

translated by 谷歌翻译

Safe Evaluation For Offline Learning: Are We Ready To Deploy?

Hager Radi , Josiah P. Hanna , Peter Stone , Matthew E. Taylor

分类：机器学习 | 人工智能

2022-12-16

The world currently offers an abundance of data in multiple domains, from which we can learn reinforcement learning (RL) policies without further interaction with the environment. RL agents learning offline from such data is possible but deploying them while learning might be dangerous in domains where safety is critical. Therefore, it is essential to find a way to estimate how a newly-learned agent will perform if deployed in the target environment before actually deploying it and without the risk of overestimating its true performance. To achieve this, we introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation (HCOPE) to estimate the performance of offline policies during learning. In our setting, we assume a source of data, which we split into a train-set, to learn an offline policy, and a test-set, to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrapping. A lower-bound estimate tells us how good a newly-learned target policy would perform before it is deployed in the real environment, and therefore allows us to decide when to deploy our learned policy.

translated by 谷歌翻译

ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation Learning

Eddy Hudson , Ishan Durugkar , Garrett Warnell , Peter Stone

分类：机器学习 | 人工智能

2022-11-08

Given a dataset of expert agent interactions with an environment of interest, a viable method to extract an effective agent policy is to estimate the maximum likelihood policy indicated by this data. This approach is commonly referred to as behavioral cloning (BC). In this work, we describe a key disadvantage of BC that arises due to the maximum likelihood objective function; namely that BC is mean-seeking with respect to the state-conditional expert action distribution when the learner's policy is represented with a Gaussian. To address this issue, we introduce a modified version of BC, Adversarial Behavioral Cloning (ABC), that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training. We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms standard BC by being mode-seeking in nature.

translated by 谷歌翻译

Improved Kidney Stone Recognition Through Attention and Multi-View Feature Fusion Strategies

Elias Villalvazo-Avila , Francisco Lopez-Tiro , Jonathan El-Beze , Jacques Hubert , Miguel Gonzalez-Mendoza , Gilberto Ochoa-Ruiz , Christian Daul

分类：计算机视觉 | 人工智能

2022-11-05

This contribution presents a deep learning method for the extraction and fusion of information relating to kidney stone fragments acquired from different viewpoints of the endoscope. Surface and section fragment images are jointly used during the training of the classifier to improve the discrimination power of the features by adding attention layers at the end of each convolutional block. This approach is specifically designed to mimic the morpho-constitutional analysis performed in ex-vivo by biologists to visually identify kidney stones by inspecting both views. The addition of attention mechanisms to the backbone improved the results of single view extraction backbones by 4% on average. Moreover, in comparison to the state-of-the-art, the fusion of the deep features improved the overall results up to 11% in terms of kidney stone classification accuracy.

translated by 谷歌翻译